This dataset includes information about over 100,000 medical appointments of different patients from different neighborhoods in Brazil, and this dataset discuss very important point that why a person makes a doctor appointment, receives all the instructions and no-show. so I will ask questions and answer it to reach the solution for this problem.
Note: The columns that have (0 , 1) values (1 means True) (0 means False).
# Use this cell to set up import statements for all of the packages that you
# plan to use.
# Remember to include a 'magic word' so that your visualizations are plotted
# inline with the notebook. See this page for more:
# http://ipython.readthedocs.io/en/stable/interactive/magics.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px
%matplotlib inline
# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 ScheduledDay 110527 non-null object 4 AppointmentDay 110527 non-null object 5 Age 110527 non-null int64 6 Neighbourhood 110527 non-null object 7 Scholarship 110527 non-null int64 8 Hipertension 110527 non-null int64 9 Diabetes 110527 non-null int64 10 Alcoholism 110527 non-null int64 11 Handcap 110527 non-null int64 12 SMS_received 110527 non-null int64 13 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
df.shape
(110527, 14)
df.describe(include=['O'])
| Gender | ScheduledDay | AppointmentDay | Neighbourhood | No-show | |
|---|---|---|---|---|---|
| count | 110527 | 110527 | 110527 | 110527 | 110527 |
| unique | 2 | 103549 | 27 | 81 | 2 |
| top | F | 2016-05-06T07:09:54Z | 2016-06-06T00:00:00Z | JARDIM CAMBURI | No |
| freq | 71840 | 24 | 4692 | 7717 | 88208 |
df.describe()
| PatientId | AppointmentID | Age | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.105270e+05 | 1.105270e+05 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
| mean | 1.474963e+14 | 5.675305e+06 | 37.088874 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 |
| std | 2.560949e+14 | 7.129575e+04 | 23.110205 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 |
| min | 3.921784e+04 | 5.030230e+06 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.172614e+12 | 5.640286e+06 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 3.173184e+13 | 5.680573e+06 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 9.439172e+13 | 5.725524e+06 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 9.999816e+14 | 5.790484e+06 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
# make a copy of the dataframe to aviod any problem
df_new = df.copy()
df_new.dtypes
PatientId float64 AppointmentID int64 Gender object ScheduledDay object AppointmentDay object Age int64 Neighbourhood object Scholarship int64 Hipertension int64 Diabetes int64 Alcoholism int64 Handcap int64 SMS_received int64 No-show object dtype: object
1- PatienID is float64 and it needs to convert to str because I dont want when I caluclate or use describe() fun to consider it as a numeric data
2- AppointmentID is int and I will convert it to str that is because the same reason of patientid
3- ScheduledDay and AppointmentDay need to convert to a datatime because I will use them to extract a month and a day form it.
# lets convert PatienID to str
df_new['PatientId'] = df_new['PatientId'].astype(str)
df_new['PatientId'].dtypes
dtype('O')
# lets covert AppointmentID to str
df_new['AppointmentID'] = df_new['AppointmentID'].astype(str)
df_new['AppointmentID'].dtypes
dtype('O')
# convert ScheduledDay and AppointmentDay to datatime
df_new['ScheduledDay'] = pd.to_datetime(df_new['ScheduledDay'])
df_new['AppointmentDay'] = pd.to_datetime(df_new['AppointmentDay'])
df_new[['ScheduledDay','AppointmentDay']].dtypes
ScheduledDay datetime64[ns, UTC] AppointmentDay datetime64[ns, UTC] dtype: object
# see the new data type after converting
df_new.dtypes
PatientId object AppointmentID object Gender object ScheduledDay datetime64[ns, UTC] AppointmentDay datetime64[ns, UTC] Age int64 Neighbourhood object Scholarship int64 Hipertension int64 Diabetes int64 Alcoholism int64 Handcap int64 SMS_received int64 No-show object dtype: object
# first lets replace uppercase to lowercase
df_new.rename(columns=lambda x: x.strip().lower(), inplace=True)
# second lets fix and put underscore in some columns
rename_column = { 'patientid' : 'patient_id', 'appointmentid': 'appointment_id',
'scheduledday' : 'scheduled_day', 'appointmentday':'appointment_day',
'hipertension':'hypertension' ,'handcap':'handicap','no-show':'show'
}
df_new.rename(columns=rename_column,inplace=True)
df_new.head(3)
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | F | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 558997776694438.0 | 5642503 | M | 2016-04-29 16:08:27+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4262962299951.0 | 5642549 | F | 2016-04-29 16:19:04+00:00 | 2016-04-29 00:00:00+00:00 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
First No-show column before change it to show means 'No' if the patient showed up to their appointment, and 'Yes' if they did not show up.
So after I changed the column to show I am going to change 'No' to 1 and 'Yes' to 0 to avoid any misunderstanding or misconception and to be like other columns (1 means True) and (0 means Flase).
So all the dataset now (1 = True) and (0 = False).
# change the show column values
df_new['show'].replace(['Yes', 'No'], [0,1],inplace=True)
from F and M to Female and Male
df_new['gender'].replace(['F','M'],['Female', 'Male'], inplace=True)
df_new.isnull().sum()
patient_id 0 appointment_id 0 gender 0 scheduled_day 0 appointment_day 0 age 0 neighbourhood 0 scholarship 0 hypertension 0 diabetes 0 alcoholism 0 handicap 0 sms_received 0 show 0 dtype: int64
df_new.duplicated().sum()
0
There is not any duplicated value also
def get_bar_chart(df_new):
feature = ['scholarship','hypertension', 'diabetes', 'alcoholism', 'handicap', 'sms_received'] # the columns I want to plot
list(enumerate(feature))
plt.figure(figsize=(25,10))
for i in enumerate(feature):
plt.subplot(2, 3, i[0]+1)
sns.countplot(x=i[1], hue='show' ,data=df_new, palette = 'Blues')
plt.show()
get_bar_chart(df_new)
df_new.corr()
| age | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | |
|---|---|---|---|---|---|---|---|---|
| age | 1.000000 | -0.092457 | 0.504586 | 0.292391 | 0.095811 | 0.078033 | 0.012643 | 0.060319 |
| scholarship | -0.092457 | 1.000000 | -0.019729 | -0.024894 | 0.035022 | -0.008586 | 0.001194 | -0.029135 |
| hypertension | 0.504586 | -0.019729 | 1.000000 | 0.433086 | 0.087971 | 0.080083 | -0.006267 | 0.035701 |
| diabetes | 0.292391 | -0.024894 | 0.433086 | 1.000000 | 0.018474 | 0.057530 | -0.014550 | 0.015180 |
| alcoholism | 0.095811 | 0.035022 | 0.087971 | 0.018474 | 1.000000 | 0.004648 | -0.026147 | 0.000196 |
| handicap | 0.078033 | -0.008586 | 0.080083 | 0.057530 | 0.004648 | 1.000000 | -0.024161 | 0.006076 |
| sms_received | 0.012643 | 0.001194 | -0.006267 | -0.014550 | -0.026147 | -0.024161 | 1.000000 | -0.126431 |
| show | 0.060319 | -0.029135 | 0.035701 | 0.015180 | 0.000196 | 0.006076 | -0.126431 | 1.000000 |
plt.figure(figsize=(10,5))
sns.heatmap(df_new.corr(), annot=True,linewidth=2, cmap= 'PuBuGn')
plt.title('Correlation between the dataset', fontsize=15)
plt.show()
After the two steps here what I figured out:
Hypertension and Diabetes have moderate postive correlation(0.43).
Hypertension and age have strong postive correlation(0.5).
Scholarship and show have negative correlation(-0.02). This means don't have scholarship will increase the possibility of showing up.
Alcoholism and show don't have any relationship(0.0002). This means it wouldn't affect showing up of the patient.
Sms_received and show have strong negative correlation(-0.13). This means didn't receive sms will increase the possibility of showing up.
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 16:08:27+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 16:19:04+00:00 | 2016-04-29 00:00:00+00:00 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 17:29:31+00:00 | 2016-04-29 00:00:00+00:00 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 16:07:23+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
night_colors = ['rgb(56, 75, 126)', 'rgb(18, 36, 37)']
fig = px.pie(df_new['gender'].value_counts(),
values = 'gender',
names = ['Female', 'Male'],
color_discrete_sequence=px.colors.sequential.Darkmint
)
fig.update_traces(text = df_new['gender'].value_counts(),
textinfo = 'label+percent')
fig.update_layout(title_text = "Gender ratio", title_x = 0.5)
fig.show()
df_new.groupby(['gender','show']).count()['patient_id']
gender show
Female 0 14594
1 57246
Male 0 7725
1 30962
Name: patient_id, dtype: int64
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,6), dpi=60)
sns.countplot(data= df_new, x= 'gender', hue='show', palette= 'Blues')
plt.title('Number of female and male according to show')
plt.ylabel('Show and No-show patients count')
plt.show()
This is just an overall count of the gender according to showing up in the appoinment day or not. but to know the answer if the gender affect showing up or not I need to calculate the precentage of both of them.
# this just to get the count of gender and i use patient_id to help me count the number of female and male
gender_count = df_new.groupby('gender').count()['patient_id']
gender_count
gender Female 71840 Male 38687 Name: patient_id, dtype: int64
# to get the the count of the female and male that show up and dont show up in the appointment day.
gender_show = df_new[['gender','show']].value_counts()
gender_show
gender show Female 1 57246 Male 1 30962 Female 0 14594 Male 0 7725 dtype: int64
female_ratio = (gender_show['Female']/gender_count['Female'])*100 # this to calculate with precenatage female that showed up and dont.
label_names = ['Female showed up ', "Female didn't show up"]
explode = [0, 0.15] # to explode the part that male did not show up
colors = ['#4F6272', '#B7C3F3']
plt.pie(female_ratio, radius=1.5, shadow=True ,labels = label_names, explode=explode,colors = colors ,startangle=180,
autopct='%0.2f%%',textprops = {"fontsize":15, "fontname":"Comic Sans MS"})
plt.title("Ratio of female who showed up and who didn't show up ",fontsize=20, y=1.2, fontname='Comic Sans MS')
plt.show()
male_ratio = (gender_show['Male']/gender_count['Male'])*100 # this to calculate with precenatage male that showed up and dont.
label_names = ['Male showed up ', "Male didn't show up"]
explode = [0, 0.15] # to explode the part that male did not show up
colors = ['#4F6272', '#B7C3F3']
plt.pie(male_ratio, radius=1.5, shadow=True ,labels = label_names, explode=explode,colors=colors, startangle=180,
autopct='%0.2f%%',textprops = {"fontsize":15,"fontname":"Comic Sans MS" })
plt.title("Ratio of male who showed up and who didn't show up ",fontsize=20, y=1.2, fontname='Comic Sans MS')
plt.show()
The gender does not affect showing up of the patient, because the female and male almost equal whether showing up or not.
df_new.groupby(['gender', 'scholarship', 'show']).count()['patient_id']
gender scholarship show
Female 0 0 12488
1 50499
1 0 2106
1 6747
Male 0 0 7253
1 29426
1 0 472
1 1536
Name: patient_id, dtype: int64
genderr= df_new['gender'].replace(['Female', 'Male'], [1,0])# convert gender column to numeric values to know the relationship
# between scholarship and showing up to the appointment.
show = df_new['show']*100 # to get the precentage of show column
plt.figure(figsize=(12,6),dpi=60)
sns.barplot(x=genderr, y=show, hue='scholarship', palette= 'Blues',data=df_new)
plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and the scholarship')
plt.show()
Yes, the gender with scholarship affect showing up of the patient. Becuase there is a negative correlation between scholarship and showing up. Here when the people whether male or female dont have scholarship the possibility of thier appearance or showing up increase.
df_new.groupby(['gender', 'hypertension', 'show']).count()['patient_id']
gender hypertension show
Female 0 0 11937
1 44564
1 0 2657
1 12682
Male 0 0 6610
1 25615
1 0 1115
1 5347
Name: patient_id, dtype: int64
plt.figure(figsize=(10,6),dpi=60)
sns.barplot(x=genderr, y=show, hue='hypertension',palette = 'Blues',data=df_new)
plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and the hypertension')
plt.show()
Yes, the gender with hypertension affect showing up of the patient. Beacause there is a relationship or a positive correlation bewteen having hypertension and showing up. Like here I see both gender who have hypertension showing up in the appointment day more than the other who dont have hypertension.
df_new.groupby(['gender', 'diabetes', 'show']).count()['patient_id']
gender diabetes show
Female 0 0 13577
1 52657
1 0 1017
1 4589
Male 0 0 7312
1 29038
1 0 413
1 1924
Name: patient_id, dtype: int64
plt.figure(figsize=(12,6),dpi=60)
sns.barplot(x=genderr, y=show, hue='diabetes', palette = 'Blues', data=df_new)
plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and the diabetes')
plt.show()
Yes, the gender with diabetes affect showing up of the patient. Beacause there is a relationship or a positive correlation bewteen having diabetes and showing up. Like here I see both gender who have diabetes showing up in the appointment day more than the other who dont have diabetes. Same as hypertension..
df_new.groupby(['gender', 'alcoholism', 'show']).count()['patient_id']
gender alcoholism show
Female 0 0 14286
1 56331
1 0 308
1 915
Male 0 0 7356
1 29194
1 0 369
1 1768
Name: patient_id, dtype: int64
plt.figure(figsize=(10,6),dpi=60)
sns.barplot(x=genderr, y=show, hue='alcoholism', palette = 'Blues', data=df_new)
plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and the alcoholism')
plt.show()
The alcoholism doesn't affect the showing up of the patient , and there isn't a strong relationship between alcoholism and showing up of the patient because:
Male: the ones who drink show up in the appointment day more than the ones who don't drink.
Female: is quite the opposit of male, the ones who don't drink show up in the appointment day more than the ones who actually drink.
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 16:08:27+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 16:19:04+00:00 | 2016-04-29 00:00:00+00:00 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 17:29:31+00:00 | 2016-04-29 00:00:00+00:00 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 16:07:23+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
# To know the unuque values of the handicap column
df_new['handicap'].unique()
array([0, 1, 2, 3, 4], dtype=int64)
# To know the value counts of the handicap column
df_new['handicap'].value_counts()
0 108286 1 2042 2 183 3 13 4 3 Name: handicap, dtype: int64
df_new['handicap'].replace({0:0, 1:1, 2:1, 3:1, 4:1}, inplace=True)
df_new['handicap'].value_counts()
0 108286 1 2241 Name: handicap, dtype: int64
plt.figure(figsize=(14,6),dpi=60)
sns.barplot(x=genderr, y=show, hue='handicap', palette = 'Blues', data=df_new)
plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and handicap')
plt.show()
Yes, the gender with handicap affect showing up of the patient. Beacause there is a positive correlation bewteen being handicapped and showing up. Like here I see both gender who is handicapped showing up in the appointment day more than the other who isn't handicapped.
df_new['sms_received'].value_counts()
0 75045 1 35482 Name: sms_received, dtype: int64
df_new.groupby(['gender', 'sms_received', 'show']).count()['patient_id']
gender sms_received show
Female 0 0 7874
1 39764
1 0 6720
1 17482
Male 0 0 4661
1 22746
1 0 3064
1 8216
Name: patient_id, dtype: int64
plt.figure(figsize=(16,6),dpi=60)
sns.barplot(x=genderr, y=show, hue='sms_received', palette = 'Blues', data=df_new)
plt.xticks([0,1],['Male', 'Female'])
plt.ylabel('Percentage of Showing up')
plt.title('Show up of the patient based on the gender and handicap')
plt.show()
No, there isn't any relationship between receiving sms and showing up in the appointment based on the gender. Beacause both gender who didn't receive sms showing up in the appointment more than the ones who received sms. So receiving sms or not has nothing to do with showing up It isn't the problem for not showing up.
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 16:08:27+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 16:19:04+00:00 | 2016-04-29 00:00:00+00:00 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 17:29:31+00:00 | 2016-04-29 00:00:00+00:00 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 16:07:23+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
df_new['age'].unique()
array([ 62, 56, 8, 76, 23, 39, 21, 19, 30, 29, 22, 28, 54,
15, 50, 40, 46, 4, 13, 65, 45, 51, 32, 12, 61, 38,
79, 18, 63, 64, 85, 59, 55, 71, 49, 78, 31, 58, 27,
6, 2, 11, 7, 0, 3, 1, 69, 68, 60, 67, 36, 10,
35, 20, 26, 34, 33, 16, 42, 5, 47, 17, 41, 44, 37,
24, 66, 77, 81, 70, 53, 75, 73, 52, 74, 43, 89, 57,
14, 9, 48, 83, 72, 25, 80, 87, 88, 84, 82, 90, 94,
86, 91, 98, 92, 96, 93, 95, 97, 102, 115, 100, 99, -1],
dtype=int64)
df_new['age'].value_counts()
0 3539
1 2273
52 1746
49 1652
53 1651
...
115 5
100 4
102 2
99 1
-1 1
Name: age, Length: 104, dtype: int64
df_new['age'].describe()
count 110527.000000 mean 37.088874 std 23.110205 min -1.000000 25% 18.000000 50% 37.000000 75% 55.000000 max 115.000000 Name: age, dtype: float64
df_new['age'].mode()
0 0 dtype: int64
df_new.query('age == -1')
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99832 | 465943158731293.0 | 5775010 | Female | 2016-06-06 08:58:13+00:00 | 2016-06-06 00:00:00+00:00 | -1 | ROMÃO | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# delete the -1 age from the dataset
df_new.drop(df.index[99832], inplace=True)
# know the count of the people with age 0
df_new.query('age == 0')['age'].count()
3539
some analysis in the age column:
min: -1 it is just one row, and it doesn't make sense so i dropped it.
max: 115 its weird but it can happen so I will leave it the same.
mode: is zero its (3539) values maybe zero means babies didn't birth yet, but it doesn't make any sense so I am going to convert it to nan.
#convert the zero values into nan values.
agee = df_new['age'].replace(0,np.nan)
fig = px.box(df_new,agee, labels=dict(x = 'Age'))
fig.update_layout(title_text = 'The distribution of the age', title_x = 0.5)
fig.show()
df_new.query('age == 115')
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63912 | 31963211613981.0 | 5700278 | Female | 2016-05-16 09:17:44+00:00 | 2016-05-19 00:00:00+00:00 | 115 | ANDORINHAS | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 63915 | 31963211613981.0 | 5700279 | Female | 2016-05-16 09:17:44+00:00 | 2016-05-19 00:00:00+00:00 | 115 | ANDORINHAS | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 68127 | 31963211613981.0 | 5562812 | Female | 2016-04-08 14:29:17+00:00 | 2016-05-16 00:00:00+00:00 | 115 | ANDORINHAS | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 76284 | 31963211613981.0 | 5744037 | Female | 2016-05-30 09:44:51+00:00 | 2016-05-30 00:00:00+00:00 | 115 | ANDORINHAS | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 97666 | 748234579244724.0 | 5717451 | Female | 2016-05-19 07:57:56+00:00 | 2016-06-03 00:00:00+00:00 | 115 | SÃO JOSÉ | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
There are outliers at the age of 115 five people.
# lets first add age range column to group the age so that we can visulaize it easily.
ages = agee
bins = [1,18, 40,115]
labels = [ '1-17','18-39', '40-115']
df_new['age_range'] = pd.cut(agee, bins, labels = labels,include_lowest = True)
plt.figure(figsize=(10,6),dpi=60)
sns.countplot(x='age_range', hue='show',palette= 'crest',data= df_new)
plt.ylabel('Count of show',fontsize=15)
plt.xlabel('Age range',fontsize=15)
plt.title('Show up of the patient based on the age range',fontsize=15)
plt.show()
Yes, the age of the patient affect showing up in the appointment day. So there is a postive correlation between the age and showing up . Here when the people getting older the possibility of showing up in the appointment day increase .
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | age_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 40-115 |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 16:08:27+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 16:19:04+00:00 | 2016-04-29 00:00:00+00:00 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 17:29:31+00:00 | 2016-04-29 00:00:00+00:00 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1-17 |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 16:07:23+00:00 | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 40-115 |
# first lets convert scheduled_day and appointment_day to date
df_new['scheduled_day'] = pd.to_datetime(df_new['scheduled_day']).dt.date
df_new['appointment_day'] = pd.to_datetime(df_new['appointment_day']).dt.date
# after that lets add the day name of the appointment
df_new['day_of_appointment'] = pd.to_datetime(df_new['appointment_day']).dt.day_name()
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | age_range | day_of_appointment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1-17 | Friday |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 40-115 | Friday |
# checking that scheduled day is before appointment day
check = df_new[df_new['scheduled_day'] > df_new['appointment_day']][['scheduled_day','appointment_day']]
check
| scheduled_day | appointment_day | |
|---|---|---|
| 27033 | 2016-05-10 | 2016-05-09 |
| 55226 | 2016-05-18 | 2016-05-17 |
| 64175 | 2016-05-05 | 2016-05-04 |
| 71533 | 2016-05-11 | 2016-05-05 |
| 72362 | 2016-05-04 | 2016-05-03 |
My doubts are true there are five rows that the appointment day is before the scheduled day and this doesn't make any sense so I will drop them.
df_new.drop(df_new.index[[27033,55226,64175,71533,72362]], inplace=True)
plt.figure(figsize=(12,5))
sns.countplot(x='day_of_appointment', data=df_new, palette = 'crest')
plt.title('Days of the appointment',fontsize=15)
plt.xlabel('Day of the week',fontsize=15)
plt.ylabel('count of the appointments',fontsize=15)
plt.show()
The days of the appointment that is full from monday to friday.
The most day the people chose to go to the appointment is wednesday.
Sunday has no appointments.
saturday is the least day of appointments.
# the count of each day of appointment
day_count = df_new.groupby('day_of_appointment')['show'].count()
day_count
day_of_appointment Friday 19019 Monday 22713 Saturday 39 Thursday 17246 Tuesday 25638 Wednesday 25866 Name: show, dtype: int64
# the number of people who show up in the appointment day
show_count = df_new.groupby('day_of_appointment')['show'].sum()
show_count
day_of_appointment Friday 14982 Monday 18024 Saturday 30 Thursday 13909 Tuesday 20488 Wednesday 20774 Name: show, dtype: int64
# the precentage of each day poeple show up in the appointment day
show_prop = df_new.groupby('day_of_appointment')['show'].mean()
show_prop
day_of_appointment Friday 0.787739 Monday 0.793554 Saturday 0.769231 Thursday 0.806506 Tuesday 0.799126 Wednesday 0.803139 Name: show, dtype: float64
fig = px.pie(df_new,
values= show_prop,
names = ['Friday', 'Monday', 'Saturday', 'Thursday', 'Tuesday', 'Wednesday'],
color_discrete_sequence=px.colors.sequential.Emrld
)
fig.update_traces(textinfo = 'label+percent')
fig.update_layout(title_text = 'Percentage the day of the week according to show up ', title_x = 0.5)
fig.show()
Thursday has the most precentage of showing up and Wednesday the second day of showing up.
# the numbers of the people who don't show up in the appointment days
no_show_count = day_count -show_count
no_show_count
day_of_appointment Friday 4037 Monday 4689 Saturday 9 Thursday 3337 Tuesday 5150 Wednesday 5092 Name: show, dtype: int64
# The precentage of each day people didn't show up in the appointment day
no_show_prop = 1- show_prop
no_show_prop
day_of_appointment Friday 0.212261 Monday 0.206446 Saturday 0.230769 Thursday 0.193494 Tuesday 0.200874 Wednesday 0.196861 Name: show, dtype: float64
fig = px.pie(df_new,
values= no_show_prop,
names = ['Friday', 'Monday', 'Saturday', 'Thursday', 'Tuesday', 'Wednesday'],
color_discrete_sequence=px.colors.sequential.Emrld
)
fig.update_traces(textinfo = 'label+percent')
fig.update_layout(title_text = "Percentage the day of the week according to no show up ", title_x = 0.5)
fig.show()
Saturday is the most day people didn't show up in the appointment day and then friday.
df_new['appointment_month'] = pd.to_datetime(df_new['appointment_day']).dt.month_name()
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | age_range | day_of_appointment | appointment_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1-17 | Friday | April |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 40-115 | Friday | April |
df_new['appointment_month'].unique()
array(['April', 'May', 'June'], dtype=object)
df_new['appointment_month'].value_counts()
May 80836 June 26450 April 3235 Name: appointment_month, dtype: int64
# the count of the month
month_count = df_new.groupby('appointment_month')['show'].count()
month_count
appointment_month April 3235 June 26450 May 80836 Name: show, dtype: int64
# the number of the people who show up in these months
month_show = df_new.groupby('appointment_month')['show'].sum()
month_show
appointment_month April 2602 June 21568 May 64037 Name: show, dtype: int64
# the precentage of each month people show up in the appointment day
show_month_prop = df_new.groupby('appointment_month')['show'].mean()
show_month_prop
appointment_month April 0.804328 June 0.815425 May 0.792184 Name: show, dtype: float64
fig = px.pie(df_new,
values= show_month_prop,
names = ['April','June','May'],
color_discrete_sequence=px.colors.sequential.Darkmint
)
fig.update_traces(textinfo = 'label+percent')
fig.update_layout(title_text = 'Month of the appointment and show up ', title_x = 0.5)
fig.show()
The ratio of three month according to show up almost the same, and we cant say the most month people show up in because there is a huge difference between the three months in values. May is 64037, June is 21568, and April is 2602.
# the number of the people who didn't show up in these months
month_no_show = month_count - month_show
month_no_show
appointment_month April 633 June 4882 May 16799 Name: show, dtype: int64
# the precentage of each month people didn't show up in the appointment day
no_show_month_prop = 1- show_month_prop
no_show_month_prop
appointment_month April 0.195672 June 0.184575 May 0.207816 Name: show, dtype: float64
fig = px.pie(df_new,
values= no_show_month_prop,
names = ['April','June','May'],
color_discrete_sequence=px.colors.sequential.Darkmint
)
fig.update_traces( textinfo = 'label+percent')
fig.update_layout(title_text = "Month of the appointment and don't show up ", title_x = 0.5)
fig.show()
we cant say the highest and the lowest ratio because there is a huge difference between the values of the three months. may is 16799, June is 4882 and April is 633.
df_new['waiting_days'] = df_new['appointment_day'] - df_new['scheduled_day']
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | age_range | day_of_appointment | appointment_month | waiting_days | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 days |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 days |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 days |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1-17 | Friday | April | 0 days |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 days |
# lets get red of days word
df_new['waiting_days'] = df_new['waiting_days'].astype('str') # to be able to use split
df_new['waiting_days'] = df_new['waiting_days'].apply(lambda x: x.split()[0])
df_new['waiting_days'] = df_new['waiting_days'].astype('int')
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | age_range | day_of_appointment | appointment_month | waiting_days | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1-17 | Friday | April | 0 |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 |
df_new['waiting_days'].unique()
array([ 0, 2, 3, 1, 4, 9, 29, 10, 23, 11, 18, 17, 14,
28, 24, 21, 15, 16, 22, 43, 30, 31, 42, 32, 56, 45,
46, 39, 37, 38, 44, 50, 60, 52, 53, 65, 67, 91, 66,
84, 78, 87, 115, 109, 63, 70, 72, 57, 58, 51, 59, 41,
49, 73, 64, 20, 33, 34, 6, 35, 36, 12, 13, 40, 47,
8, 5, 7, 25, 26, 48, 27, 19, 61, 55, 62, 176, 54,
77, 69, 83, 76, 89, 81, 103, 79, 68, 75, 85, 112, 80,
86, 98, 94, 142, 155, 162, 169, 104, 133, 125, 96, 88, 90,
151, 126, 127, 111, 119, 74, 71, 82, 108, 110, 102, 122, 101,
105, 92, 97, 93, 107, 95, 139, 132, 179, 117, 146, 123])
days = df_new['waiting_days']
bins= [-1,21,46,65,85,105,125,142,160,179]
labels = ['0-20', '21-45', '46-64', '65-84', '85-104', '105-124', '125-141','142-159', '160-179']
df_new['days_range']= pd.cut(days, bins, labels = labels,include_lowest = True)
df_new.head()
| patient_id | appointment_id | gender | scheduled_day | appointment_day | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | show | age_range | day_of_appointment | appointment_month | waiting_days | days_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29872499824296.0 | 5642903 | Female | 2016-04-29 | 2016-04-29 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 | 0-20 |
| 1 | 558997776694438.0 | 5642503 | Male | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 | 0-20 |
| 2 | 4262962299951.0 | 5642549 | Female | 2016-04-29 | 2016-04-29 | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 | 0-20 |
| 3 | 867951213174.0 | 5642828 | Female | 2016-04-29 | 2016-04-29 | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1-17 | Friday | April | 0 | 0-20 |
| 4 | 8841186448183.0 | 5642494 | Female | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 40-115 | Friday | April | 0 | 0-20 |
# graph the count of the waiting days
plt.figure(figsize=(12,5))
sns.countplot(x='days_range', data=df_new, palette= 'crest')
plt.title('Waiting days between scheduled days and appointment days',fontsize=15)
plt.xlabel('Delay in days', fontsize=15)
plt.ylabel('Count of the waiting days',fontsize=15)
plt.show()
# graph the waiting days according to showing up or not
plt.figure(figsize=(12,5))
sns.barplot(x='days_range', y='show',data=df_new, palette='crest')
plt.title('Showing up or not based on waiting days',fontsize=15)
plt.ylabel('Percent of show',fontsize=15)
plt.xlabel('Delay in days ')
plt.show()
The waiting days doesn't affect showing up of the patient.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])
1